Healing Length and Bubble Formation in DNA 
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We have recently suggested that the probability for the formation of thermally activated DNA 
bubbles is, to a very good approximation, proportional to the number of soft AT pairs over a length 
L(n) that depend on the size n of the bubble and on the temperature of the DNA. Here we clarify 
the physical interpretation of this length by relating it to the (healing) length that is required for 
the effect of a base-pair defect to become neligible. This provides a simple criteria to calculate L(n) 
for bubbles of arbitrary size and for any temperature of the DNA. We verify our findings by exact 
calculations of the equilibrium statistical properties of the Peyrard-Bishop-Dauxois model. Our 
method permits calculations of equilibrium thermal openings with several order of magnitude less 
numerical expense as compared with direct evaluations. 



I. INTRODUCTION. 

Local separation of double-stranded DNA into single- 
stranded DNA is fundamental to transcription and other 
important intra-cellular processes in living organisms. 
In equilibrium, DNA will locally denaturate when the 
free energy of the separated single-stranded DNA is less 
than that of the double-stranded DNA. Because of the 
larger entropy of the flexible single-strand, the more rigid 
double-strand can be thermally destabilized locally to 
form temporary "bubbles" in the molecule already at 
physiological temperatures Q . Considering this entropic 
effect together with the inherent energetic heterogene- 
ity - GC base pairs are 25 % more strongly bound than 
the AT bases - of a DNA sequence, it is plausible that 
certain regions (subsequences) are more prone to such 
thermal destabilization than others. In fact recent work 
demonstrates not only that such a phenomena exists 
but more importantly that the location of these large 
bubble openings in a variety of DNA sequences coincide 
with sites active during transcription events. This dis- 
covery represent a significant advance in the understand- 
ing of the relationship between local conformation and 
function in bio-molecules. While there is no guarantee 
that this mechanism applies to all transcription initia- 
tion events, the present agreement is very encouraging. 
Similarly, other large bubbles identified may well have a 
relationship to other biological functions. The agreement 
is based on the Peyrard-Bishop-Dauxois (PBD) model, 
Q , which evidently contains some essential basic ingredi- 
ents -local constraints (nonlincarity), base-pair sequence 
(colored disorder) and entropy (temperature). The equi- 
librium thermodynamic properties of the model were nu- 
merically calculated from the partition function using the 



transfer integral operator method (TIO). (A complemen- 
tary direct numerical evaluation of the partition function 
has been reported in Ref. |f|). This allows the precise 
evaluation of probabilities of bubbles as a function of 
temperature, location in a given base-pair sequence, and 
bubble size. In recent work we reported that the 
probabilities of finding bubbles extending over n sites 
do not depend on a specific DNA subsequences. Rather, 
such probabilities depend on the density of soft A/T base 
pairs within a region of length L(n) . Here we suggest that 
this characteristic length is simply related to the charac- 
teristic distance away from an AT base pair - consid- 
ered as a defect placed in a homogeneous GC-sequence- 
where the probability values of the base pairs return to 
the GC bulk-value. Lastly, based on this concept of ef- 
fective density approximation, we examine five different 
human promoter sequences, and demonstrate the striking 
agreement in the predictions from the two methods. 



II. THE PBD MODEL AND THE TIO METHOD. 

The potential energy of the PBD model is 

N N 
ri—l n— 1 

■ (i) 

Here V(y n ) — D n (e~ anVn — l) 2 , represents the nonlin- 
ear hydrogen bonds between the bases; W(y n ,y n -i) = 
| (l + pe-^+f"- 1 )) (y n - y n -i) 2 is the nearest- 
neighbor coupling that represents the (nonlinear) stack- 
ing interaction between adjacent base pairs: it is com- 
prised of a harmonic coupling with a state-dependent 
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coupling constant effectively modeling the change in stiff- 
ness as the double strand is opened (i.e. entropic effects). 
The sum in Eq. is over all base-pairs of the molecule 
and y n denotes the relative displacement from equilib- 
rium bases at the n th base pair. The importance of the 
heterogeneity of the sequence is incorporated by assign- 
ing different values to the parameters of the Morse poten- 
tial, depending on the the base-pair type. The parameter 
values we have used are those in Refs. 0, |t| chosen to 
reproduce a variety of experimentally observed thermo- 
dynamic properties. 

The equilibrium thermodynamic properties of the PBD 
model can be calculated from the partition function 

Z = [ n dy n e-^^ 

„ s+fc — 1 

= / II dy n Z k (s)e-^y—\ (2) 

^ n=s 

where we have introduced the notation 

N 

Zk{s) = J J] dy n e-P £ ^>y^ 

and (3 = (fc^T) -1 is the Boltzmann factor, fn order to 
evaluate the partition function @ using the TIO method, 
we first symmetrize e~^ £t ^ x,y ^ by introducing Q 

S(x,y) = exp(-^(V(x)+V(y) + 2W(x,y)) S j 
= S(y,x). 

Here the second equality holds only when x and y corre- 
spond to base-pairs of the same kind. Using Eq. J5J the 
expression for Z k (s) is rewritten as 




xdy e-^ v ^e-^ v ^\ (3) 

where open boundary conditions at n = 1, and n = N 
have been used. To proceed, a Fredholm integral equa- 
tions with a real symmetric kernel 

J dyS(x,y)<P(y) = \cf>(x) (4) 

must be solved separately for the A/T and for the G/C 
base-pairs. 

Since the eigenvalues are orthonormal and the eigen- 
functions form a complete basis, Eq.Q can be used se- 
quentially to replace all integrals by matrix multiplica- 
tions in Eq. ©. Unlike in Ref. [8j where the kernels 
S(x,y) were expanded in terms of orthonormal bases, 
here we choose to use Eq. (0} iteratively. In this way 
we reduce the number of integral equations that need to 



be solved from 4 to 2, and at the same time the ma- 
trices that need to be multiplied are lower dimensional. 
Whenever the sequence heterogeneity results in a non- 
symmetric S(x,y), Eq.Q cannot be used and we resort 
to a symmetrization technique, based on successive intro- 
duction of auxiliary integration variables, as explained in 
Ref. [13. 

We evaluate the probabilities Pk{s), for a base-pair 
opening spanning k base-pairs (our operational definition 
of a bubble of size fc), starting at base-pair s as 

roo s+fc— 1 

P k {s) = Z- 1 YL dy n Z k (s)e-P £ ^y^ (5) 

where t is the separation (which we have here taken as 
1.5 A) of the double strand above which we define the 
strand to be melted. 



III. LENGTHSCALES AND EFFECTIVE 
DENSITY APPROXIMATION. 

In Ref. Q we suggested that the probabilities of finding 
bubbles extending over n sites localized around a given 
bp, is, to a very good approximation, proportional to 
the density of soft A/T base pairs within a region of 
length L(n) centered around the same bp, an approach 
we term here as effective density approximation (EDA). 
The lengths L(n) were obtained from numerical transfer 
integral calculations of the bubble probabilities of several 
simple (but experimentally realizable) sequences. The 
A/T density profiles were therefore compared with the 
exact probabilities for thermal activation of bubbles of 
sizes n = 1 and n = 5 of a wild and a mutant version 
of the AAV P5 promoter. The agreement was excellent. 
However, no physical explanation for the origin of these 
characteristic lengths was provided nor were they con- 
nected to any intrinsic length of the PBD model. But, 
since they appear prominently in the formation of DNA 
bubbles, it is important to investigate both of these ques- 
tions. 

In Figs lll3l we consider a sequence composed of 150 
G/C +1 A/T +150 G/C. In other words, we place a 
defect (A/T instead of G/C) at the site l = 151. This 
defect is 150 bp away from the two ends of the sequence 
in order to eliminate boundary effects. 

A/T base pairs have a smaller bonding energy than 
GC bps. Therefore, the A/T defect softens a number 
of GC-bps around it and increases the opening probabil- 
ity. Clearly, sufficiently away from the defect the open- 
ing probability regains the bulk value of a homogeneous 
G/C-sequence at the given temperature, given threshold 
and given bubble size. Our claim is that the characteris- 
tic length L(n) is the distance necessary to be away from 
the defect so that the G/C bps there are no longer af- 
fected. This can be quantified by calculating the relative 
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fluctuation 

P n (lg - L(n)) - P n (110) ... 

Km = * (6) 

where Iq — L(n) is the bp site obtained counting L(n) 
downstream from the defect site, see Figs lll3l and at site 
110 we assume that the bulk value has been regained. 
The remarkable finding is that with the choice of L(n) 
considered in our previous work Q, obtained indepen- 
dently by merely fitting the full numerical TIO calcula- 
tions of the bubble formation probabilities of different 
simple sequences, we obtain from Eq.© a ~ 2.5%, inde- 
pendently from the size of the bubble and the tempera- 
ture of the DNA sequence. This can be seen in Figs lll3l 
the circle at bp = 151 = Iq is the A/T defect, while the 
circles at bp — 141, 139, 135 are the positions of the bp 
at Iq — L(n). We can therefore reverse the perspective 
and define the characteristic length as the one given by 
Eq.©, with a ~ 2.5%. This is important for pratical 
applications, since it gives a simple criterion to estimate 
bubbles probabilities for arbitrary bubble sizes and DNA 
temperatures (and arbitrary PBD inter base-pairs inter- 
action parameters), but it also immediately suggests a 
simple physical explanation for L(n). 
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FIG. 1: The probability profile for the creation of a bubble 
of size 5 bp. The isolated A/T bp embedded in a sequence 
of G/C bps at bp = 151 is denoted by a solid black circle. A 
second black circle is located at bp — 151 — L(5) = 141. The 
relative error M±Sk0im = .0232. 

We parametrize the decay of the probability values as a 
function of the downstream distance from the A/T defect 
according to 

P n (l) =A n + B n exp[- l °~ n + 1 ~ l ], (7) 

where P n {l) is the probability for finding a bubble of 
size n located at the site A n is the bulk value of the 
homogeneous G/C sequence and A n + B n is the value 
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FIG. 2: The probability for the creation of a bubble of size 7 
bp. The black circle at bp — 151 represents the defect. The 
second black circle is located at bp — 151 — 1/(7) = 139. The 
relative error P7(13 p 9 )- ff< 110 > = 0.0279. 
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FIG. 3: The probability for the creation of a bubble of size 
10 bp. As before, the defect is represented by a solid black 
circle, and a second one is located at bp — 151 — L(10) — 135. 
The relative error fl,(13 ^!» (110) = 0.0203. 



of the probability at the site Zo — tt. + 1, which is the 
same as the probability value of the defect site Zo- £« 
is the healing length of the system, namely the char- 
acteristic length for the perturbation to die out, which, 
quite generally, depends on the size n of the bubble, tem- 
perature of the DNA and the parameters of the PBD 
model. Replacing Eq.J7Jl in Eq.@), we obtain the re- 
lation L(n) = n - 1 + £, n In > wnere B n/Ai = 
(P n (l )-P n (bulk))/P n (bulk) ~0.34. We emphasize that 
both B n /A n and a are independent of the size n of the 
bubble. It follows that there is a simple linear relation 
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between the healing and characteristic lengths: 

L(n) = n- 1 + 2.6 £ n . (8) 

This is a very important result of this report. The heal- 
ing length can be easily calculated as a function of the 
bubble size and temperature with an homogeneous G/C 
sequence plus a single defect, as shown above. From this, 
we can calculate the value of L(n) and, therefore, esti- 
mate the probability for the creation of bubbles for arbi- 
trary DNA sequences at any temperature. For instance, 
for bubbles of size n = 7, we obtain L(7) = 12, while for 
bubbles of size n = 10 we have L(1Q) = 16 at T = 300 K 
and PBD parameters as in 

In order to examine how the values of the parameters 
of the PBD model affect those of L(n), we set p = and 
repeat the calculation of L(10). Since, when p decreases, 
so does the " cooperativity" of the base base pairs, one 
would expect to observe a drop in the £(10) value. This is 
indeed the case: L(10) = 14, while for p = 2 the value as 
16. We will show in the next Section how this approach 
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FIG. 4: In this figure we show the probability for the creation 
of a bubble of size 10 bp, when the coupling constant p — 
0. As is indicated by the relative error p io( 13 J)-^io( 110 ) _ 
0.0216, now 1,(10) = 14. 

compares with exact transfer integral operator calcula- 
tions of the statistical properties of the PBD model. 

We conclude this Section by noting that the Figs. [T\ 
13 exhibit symmetry, but not with respect to the defect. 
While the defect is always at bp = 151, the symmetry 
is with respect to bp 149 in the P5 case, 148 in the P7 
case, and the axis that separates bps 146-147 in the P10 
case. Another feature is the existence of a second local 
maximum with the same value as P n (151), and a slight 
drop in the probability values in the middle of the peak. 
We notice that the two maxima are located at sites Iq 
and lo — n+ 1. This suggests that a bubble with a defect 
at its boundary has a higher probability to form: in the 
P n (lo — n+1) case the defect is at the end of the bubble, 
while in the P n (lo) case it is at the begining of the bubble. 



Also, the probability drops in the middle of the peak 
because the bubble there contains a defect that is trapped 
within G/C bps, and it turns out that the probability of 
formation of a bubble of this kind is smaller. 



IV. COMPARISON OF THE EDA AND TIO 
METHOD. 

We now compare the probability profiles obtained from 
the effective density approach with the characteristic 
length L(n) calculated as in the previous Section, with 
exact results obtained with the TIO method. We con- 
sider five different human genome subsequences, and 
compare the calculations for the probability of formation 
of bubbles of sizes n = 7 and n = 10. 

In the panels (a, b) of Figs l5l9l we plot (as a function of 
the bp site) the number N7 and Niq of A/T bps calcu- 
lated over a distance L(7) = 12 (panel a) and L(1Q) = 16 
(panel b). These A/T density profiles can be compared 
with the probability for the thermal creation of bubbles 
of seven, P7, and ten, P10, sites, panels (c, d). In all cases 
(and in several other not reported here) the resemblance 
in the main features of the respective profiles is strik- 
ing. In particular, EDA correctly predicts the locations 
and relative weights of the probability peaks. The crucial 
point is that, while the profiles obtained with the EDA 
requires few seconds to be calculated, the full TIO meth- 
ods is very time consuming (of the order of several hours 
in the cases presented here) . To fully appreciate this ad- 
vantage, we note that with the EDA the entire human 
genome can be sequenced for bubble formation probabil- 
ities in few minutes, while a statistical approach based 
on the calculation of the partition function is clearly im- 
possible. 
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FIG. 5: Effective density profiles for 7 and 10-site long bub- 
bles (a,b) and probability profiles calculated with the transfer 
integral approach, (c,d). The sequence is the cox 8 promoter. 
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FIG. 6: Effective density profiles for 7 and 10-site long bub- 
bles (a,b) and probability profiles calculated with the transfer 
integral approach, (c,d). The sequence is the cox 11 promoter. 



FIG. 8: Effective density profiles for 7 and 10-site long bub- 
bles (a,b) and probability profiles calculated with the transfer 
integral approach (c,d). The sequence is the h33a promoter. 





FIG. 7: Effective density profiles for 7 and 10-site long bub- 
bles (a,b) and probability profiles calculated with the transfer 
integral approach (c,d). The sequence is the gtf2f2 promoter. 



FIG. 9: Effective density profiles for 7 and 10-site long bub- 
bles (a,b) and probability profiles calculated with the transfer 
integral approach (c,d). The sequence is the h3b promoter. 



V. CONCLUSIONS. 

It has been suggested that the DNA transcription initi- 
ation sites can coincide with the location of large bubble 
openings. A thorough investigation of this hypothesis re- 
quires the statistical analysis of many DNA promoters 
within the PBD model. Such a task becomes quickly 
prohibitive when studying bubble-promoter correlations 
in a significantly large number of cases (namely, for large 
sequences). This problem has motivated the develop- 
ment of an alternative simplified approach to calculate 
the bubble formation probabilities. We have found that 
this probabilities are proportional to the density of soft 
A/T base-pairs calculated over an effective length which 
depends on the size of the bubble and the temperature of 



the DNA. We have clarified the physical origin of such a 
length and suggested a simple procedure for its calucla- 
tions. The results of our effective density approach are 
in extremely good agreement with full exact calculations, 
but with a numerical effort reduced by several order of 
magnitudes. In this way, the full human genome can be 
analyzed, opening a unique possibility to understand the 
existence and nature of the correlations between ther- 
mally activated bubbles and promoters. 
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